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ABSTRACT 


Paper documents can be converted into digital form, as a 
Sorecelonewor 1mages, Or a Combination Of ASCII text and 
images. Full text and image document databases, display 
advantages and disadvantages during scanning and conversion 
processes. Conversion of paper thesis documents could be 
eliminated, if thesis documents could be submitted in 
Gigital form, for storage on optical disks. 

Utilizing existing paper thesis documents, image and 
full text databases were developed and e\ luated to 
determine the best digital form for storage of paper 
documents. Analysis was performed on a thesis document in 
digital form, to determine the most feasible format for 
digital document submission. 

This thesis concludes that conversion of paper documents 
to digital form should not be pursued. Instead, thesis 
documents should be submitted in digital form for direct 
conversion and storage on optical disks. Follow on thesis 
research is recommended to build an in-house CD-ROM 


mastering system for this purpose. 
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I. INTRODUCTION 


A. DISCUSSION 

The storage of thesis documents, at the Knox Library 
aboard Naval Postgraduate School, requires a great deal of 
space. These documents can be converted to digital forn, 
for storage on optical disks, using optical scanners and 
optical disk publishing software. Documents can be 
digitized and stored, using one of two fundamental forms; as 
bit mapped images, or as full text documeits with images of 
graphics. Each form provides advantages .nd disadvantages, 
with respect to document conversion and retrieval. A CD-ROM 
optical disk will hold approximately 1200 thesis documents 
fe 5000 to 207 >500 pages). Since information stored on 
optical disks, is only retrievable using a computer system, 
it is vitally important that, different topical areas within 
this information, be accessible, without spending a lot of 
time or energy. 

The advent of word processors for computers, means that, 
thesis documents exist in digital form during their 
formulation and completion. It would be advantageous to 
receive completed thesis documents in digital form, for 


direct storage on optical disks. 


This thesis analyzes the advantages and disadvantages of 
storage and retrieval of thesis documents in both image and 
full text form. It also analyzes the benefits and drawbacks 


of submitting thesis documents in electronic form. 


B. SCOPE 

This thesis includes an in-depth analysis of an image, 
and full text database. The purpose of this analysis is to 
provide an accurate determination of the advantages and 
Gisadvantages of each digitized form, for conversion and 
storage of existing documents on optical disks. 

Analysis of requirements for submission of completed 
thesis documents in electronic form is also conducted, anda 


recommendation for possible implementation is made. 


C. METHODOLOGY 

Utilizing thesis documents presently stored at Knox 
Library, an image and full text database was developed by 
optically scanning documents, for image generation and 
optical character recognition. Optical disk publishing 
software was used to link images of graphics with full text 
of documents, and perform markup and indexing of documents, 
for searching and retrieval. Each database was evaluated to 
determine which provided the greatest benefits to the user 


(searcher). 


Utilizing a digital copy (text with separate images of 
graphics), of a completed thesis, an experiment was 
conducted to determine the requirements for turning in 
thesis documents in electronic form. This experiment 
focused on the advantages and disadvantages pertaining to a 
digital document in word processor format; WordPerfect 5.1, 


versus a document in ASCII text format. 


II. OPTICAL TECHNOLOGY 


A. INTRODUCTION 

The generation of a digital database for this thesis, 
and the retrieval of that information, primarily deals with 
two areas of optical technology: Optical Scanners and 
Compact Disk Read Only Memory (CD-ROM). This section will 
discuss in general, the technologies associated with 
scanners: their basic operation, optical character 
recognition (OCR), and image scanning. The basic operation, 
structure, advantages, and disadvantages of CD-ROM are 


discussed. 


B. OPTICAL SCANNERS 

Optical scanners convert hard copy documents into a 
digital format. This is accomplished by converting the 
document into a bit mapped image or into text using OCR. 
The following sections will describe in general: the basic 
theory of how a scanner works, the different characteristics 
of a scanned image, and optical character recognition. 

1. Scanner Technology 

There are three basic types of optical scanners: 

moving paper scanners, flat bed scanners, and electronic 
digitizing camera scanners. The primary difference among 


these scanning technologies is the method of document 


illumination (or document transport) during the scanning 
process. (Fowler and Clipper, 1991, pp. 44-46) 

The scanning process is basically accomplished by 
five components: a group of Charged Coupled Devices (CCD); a 
light source; a set of optics to force the light image onto 
the CCD; electronics to distinguish between dark (inked) 
areas, and light (white) areas; and a transport mechanism to 
move either the paper, or the scanning device depending on 
the type of scanner.' (Wageman, 1989, pp. 3.112.4 -3.112.6) 


Figure 1 portrays the interaction of components during the 


scanning process. 





Figure 1 Scanner Operation 
@uavlor, 987, p. 10) 


' A Charged Coupled Device is a light sensitive 
semiconductor which produces an electrical signal which is 
Peeceomrlenat co the light incident upon it. 


During the process of scanning, a document is 
illuminated, a strip at a time by a light source. The 
movement of the document relative to the light source is 
accomplished by the transport mechanism. The transport 
mechanism either moves the scanning device (light source) 
across the document, or moves the document across the 
scanning surface. The reflected light is directed via a set 
of optics, onto a series of CCD's, which transform the 
optical signal into an electrical signal. The electrical 
Signal is converted into binary one's and zero's, which 
represent the scanned image, created by electronics, and the 
scanner software. 

2. Scanning Images 

Unlike text in ASCII format, an image is not 
Girectly accessible. It is a two dimensional bit mapped 
representation. There are three characteristics used to 
discuss image scanning: resolution, levels of greyscale, and 
color. These characteristics of image scanning are 
discussed in the following sections. 

a. Resolution 

Resolution is a measure of the level of detail 
represented in an image. Resolution is typically discussed 


in terms of pixels, or dots per inch (dpi). Pixel, origi- 


nates from a television industry term; "picture element’." 
Dpi refers to the number of pixel dots used to represent 
that image. A resolution of 400 dpi implies that, every 
square inch of that image is represented by 160,000 pixel 
Gots (400 x 400). An image with a resolution of 400 dpi 
contains more detail and thus better quality, then an image 
with a resolution of 200 dpi. 

Image resolution is often confused with display 
resolution. Image resolution refers to the resolution of 
the stored image. Unlike display resolution, which refers 
to the number of pixels the screen can display. An image 
with a resolution less than the display w.11 only take up a 
portion of the display; where as an image with a resolution 
greater than the display will encompass an area greater then 
the display. (Apperson and Doherty, 1987, pp. 124-126) 

b. Levels of Greyscale 

The number of bits stored per pixel determines 
the amount of information that can be stored for each pixel. 
The ability to store this additional information 
distinguishes black and white images from greyscale images. 


In a black and white image, each pixel is represented by one 


¢ The detail of a television picture image is 


determined by a combination of picture elements per 
horizontal scan line, and the number of horizontal scan 
lines per picture. In the United States a picture contains 
525 scan lines with 435 picture elements per scan line. 

This equates to a resolution of 435 x 525. (Fink, 1984, pp. 
83-84) 


bit which corresponds to either black or white. Ina 
greyscale system two or more bits are stored for each pixel. 
The number of grey levels which can be represented by each 
pixel is determined by two raised to the power of the number 
of bits stored per pixel (2%). For example ina four bit 


per pixel system, each pixel could be one of 16 shades of 


grey. 
c. Digitizing Color Images 
The representation of color images is very 
Similar to that of greyscale images. Most color scanners on 


the market today produce color images by scanning a document 
three times in succession; once for each of the primary 
colors: red, green, and blue. Each of the three scans uses 
as its light source, one of three mercury vapor lamps which 
produce red, green, or blue light. Each of the three scans 
produces a color bit map, which is later merged to produce 
the color image. The number of bits stored for each of the 
primary colors, determines the displayable colors. Storing 
four bits per primary would allow 4096 colors (2% x 2% x 2% = 
4096). (Apperson and Doherty, 1987, pp. 143-146) 
3. Optical Character Recognition 

Optical Character Recognition (OCR), or Intelligent 
Character Recognition (ICR), is the process by which the 
text (alphanumeric characters) of a scanned image is 


converted, from a bit mapped representation into ASCII 


characters. There are two primary methods used in this 
process; matrix matching and feature matching. 

a. Matrix Matching 

The process of matrix matching breaks the image 

into a matrix of blocks, corresponding to character 
locations. These blocks are subdivided into a matrix of 
pixels. The pixel matrix of the character block, is then 
compared with those of the standard alphanumeric characters, 
until a match is found that satisfies a preprogrammed level 
of confidence. 

b. Feature Matching 

The process of feature matching is similar to 

matrix matching, as the character locations are broken down 
into a matrix of pixels. In feature matching, the 
comparison is not of a whole pixel pattern, but of the 
distinct features of the pattern. These features include 
vertical, horizontal and diagonal lines, as well as loops. 
An illustration of feature comparison is shown in Figure 2. 
The determination of a correct match is again based ona 
preprogrammed level of confidence. 

For a more detailed explanation of OCR see Taylor, 


1987, pp. 12-17. 


FEATURE ANALYSIS IN OCR TECHNOLOGY 
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Figure 2 Feature Matching (Wageman, 1989, p. 
IAB 0s) 


C. COMPACT DISK READ ONLY MEMORY 
1. Basic Concepts 

Compact Disk Read Only Memory (CD-ROM) is an 
internationally accepted mass storage media. The "read 
only" capability of CD-ROM makes it ideally suited for the 
distribution of large amounts of unchanging data. 

The world wide acceptance of CD-ROM as a reliable 
mass storage media is a result of its standardization. The 
physical and logical requirements of CD-ROM are established 
by the specifications in International Standards 


Organization (ISO) 9660, commonly referred to as "High 


ins 


Sierra Group." This standardization ensures that a CD-ROM 
disc, manufactured in accordance with the ISO 9660 standard 
can be played on any CD-ROM drive meeting the same standard. 
2. CLV vs CAV 

A CD-ROM drive is a constant linear velocity (CLV) 
device, rather than a constant angular velocity (CAV) device 
used for most magnetic drives. This means that the rotation 
speed of the disc varies, depending on the location of the 
read head. As the read head moves from the inner to the 
outer edge of the disc, the speed of rotation would decrease 
to maintain the disc area under the read head, at a constant 
velocity relative to the read head. This implies that, the 
density of the data written on the disk remains constant 
throughout the entire disk. 

3. Physical Structure 

The physical structure of a CD-ROM disc can be 
characterized by two distinct characteristics: microscopic 
pits and lands, and a continuous spiral. As shown in Figure 
3, data is represented by a series of microscopic lands and 
pits. This data is read by measuring the reflected light 
from a laser beam as it passes over the pattern of 
microscopic pits and lands. As the laser passes from land 
to pit, or from pit to land, the amount of reflected lighted 


light changes; representing a binary one. 


aL 


0010000100000000100100100001001001000000 


reflective 
aluminum 
coating 


MM PDC SEZ pr IOI: 4 


Pit: microscopic depression in the 
reflective surface of a CD-ROM 


Land: raised and reflective surface 
between 2 adjacent pits. 


Binary 1: the transition from pit to land 
and land to pit. 





Figure 3 Microscopic Structure of CD-ROM Disc 
(Hoge, 1989, p.54) 
These microscopic pits are arranged in a continuous 

Spiral, as depicted in Figure 4. The density of this spiral 
is 16,000 tracks per inch, providing a total storage length 
of approximately three miles. The continuous spiral of a 
CD-ROM disc is divided into 270,000 sectors, that are 
referenced in terms of minutes, seconds, and sector.? 


There are 2352 bytes per sector; 2048 bytes of which are 


useable. This equates to a useable storage capacity of 


> Each CD-ROM disc contains 60 minutes of play. Each 
minute contains 60 seconds. There are 75 sectors per 
second. 60 minutes/disc x 60 seconds/minute x 75 
sectors/second = 270,000 sectors/disk. 


a2 


540,000 kilobytes per CD-ROM disk (270,000 sectors/disk x 2 
kilobytes/sector). The utilization of bytes per sector is 


Slovwietieiaabe 1. (Fowler ana Clipper, 1991, pp. 36-38) 





Vig 
Mii 
WW 
WY 
Constant Angular Velocity (CAV) Constant Linea. Velocity (CLV) 
Concentric Tracks Spiral Track 
Figure 4 CLV vs CAV (Fowler and Clipper, 1991, p. 
37) 


TABLE 1. BYTE UTILIZATION PER SECTOR 


Synchronization data 
Header data 

User data 

Error Detection data 
Unused data 


pmol Correction data 


2052 bytes 


LS 





4. Advantages and Disadvantages of CD-ROM 


If thesis documents were available in digitized 


form, what would be the advantages and disadvantages of 


storing this information on CD-ROM? 


ile 


on 


a 


Compact Storage: No other media provides the storage 
Capacity for such a vast amount of information in such 
a compact size. 


Excellent File Integrity: The read only nature of CD- 
ROM alleviates the problem of inadvertent modification 
or erasure. The estimated 50 year life of CD-ROM 
ensures reliable file integrity when used for archival 
of data. (Arnold, July 71991, pp. 40) There iS fio risk 
of head crashes resulting in lost data, like those 
capable on magnetic drives. 


Extremely Cost Effective: The information storage cost 
using CD-ROM is 4 cents/MB, compared to 90 cents/MB 
using floppy disks, 50 cents/MB using WORM and $4.00/MB 
using winchester drives. The only media which is 
currently cheaper than CD-ROM is Digital Paper at .006 
cents/MB. The draw back of Digital paper is that it is 
an emerging technology with very expensive drives. 
(Arnold, July 1991, pp. 40-44) 


Slow Access Times: The only real disadvantage of CD- 
ROM, is its relatively slow access times, compared to 
Winchester and floppy disk drives. The access time for 
a Winchester or floppy disk drive is more than 10 times 
faster than a CD-ROM disc drive. 


14 


III. INFORMATION RETRIEVAL 


A. INTRODUCTION 

The electronic revolution has enabled the storage of 
vast amounts of information (in the form of documents), on 
digital media. Information in electronic form, unlike 
information on paper, can only be retrieved electronically 
via a computer system. Consequently, our ability to 
retrieve information from an electronic document database, 
is based on our ability to properly index the information, 
so it can be located effectively and efficiently by the 
user. A database with documents well indexed, can provide 
for effective retrieval of information, thus allowing for 
tremendous storage and time saving benefits, of electronic 
information, to be fully realized. 

The following sections discuss; differences between 
retrieving documents and retrieving data, the problems 
associated with document retrieval, some measures of 
retrieval effectiveness, browsing, and the different methods 


of searching for documents. 


tS 


B. DOCUMENT RETRIEVAL VS DATA RETRIEVAL 

Retrieval of data is much different than retrieval of 
documents. A study conducted in 1984, David C. Blair points 
out four fundamental ways document retrieval differs from 


Gata Getrrevale 


- "the way queries are answered." 


- "the relation between the formal system request, and 
user satisfaction." 


- "the criterion for successful retrieval." 
- "the factors that influence speed." (Blair, 1984, pp. 
369-370). 
These differences are discussed below: 
1. Response to a Query 
Data retrieval systems return the data, which is the 
direct answer to the guestion. For example, if asked for 
Joe Smith's address, the system would return the data that 
was Joe Smith's address (if it was in the database). 
Document retrieval systems require that a document(s) be 
located by matching key words, phrases, or key fields 
contained within these documents. This implies that, an 
answer to a document retrieval system guery will not be the 
data directly, but a list of documents likely to contain the 
answer. The user must then review the documents to 


determine if they contain the required information. 


als) 


2. Question/Answer Relationship 
A document search for key words or phrases provides 

the user with a list of documents that have a high 
probability of containing the required information. This is 
based on the user's perception of what words or phrases are 
best used to match a subject area to a document search. 
This highly probabilistic process, depends greatly on the 
user's prediction of how information is presented in the 
documents. Conversely, data retrieval systems are based on 
logical relationships between a guestion and an answer. The 
system determines an answer based on a question. If asked 
for the telephone number of Joe Smith, the system will 
locate and provide the telephone number of Joe Smith, all 
other answers would be incorrect. 

3. Utility - The Criterion for Successful Retrieval 

Data retrieval systems provide either, an 

appropriate answer, or no data at all (i.e. the desired 
information is not in the database). Success is based on 
correctness. Document retrieval systems provide a set of 
documents, that must be reviewed to determine usefulness. 
The measure of success, is not correctness, but how useful 
the retrieved documents are to the user; or utility. As you 
may presume, utility is a subjective measure, that varies 


Erom USemE, to user. 


7 


4. Influence on Retrieval Speed 

The specific deterministic nature of data retrieval, 
results in response times that are closely tied to system 
speed. Document retrieval is an iterative process, heavily 
dependent on the user's ability to accurately predict how 
information is presented. An accurate prediction will 
induce a query with enough specificity to isolate desired 
information while excluding most unwanted information. A 
less accurate prediction will result in a less precise 
query, with less favorable results. Consequently, the 
iterations (search/evaluate/query refinement) required to 
isolate the desired information, are an outcome of query 
formulation which directly effects overall retrieval time. 
This results in a total retrieval time that is heavily 
biased by the user, and less dependent on the system. For 
this reason, a retrieval system that provides both effective 
and efficient tools for formulating queries, will out 
perform one that does not, despite running on a faster 
system. This is particularly important in CD-ROM databases, 
whose access times are significantly slower then traditional 
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C. FULL TEXT DOCUMENT RETRIEVAL 
J. D. Fowler and B. Clipper divide electronic document 
retrieval systems into three classes: database document 


retrieval systems, full text retrieval systems, and hybrid 
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systems (Fowler & Clipper, 1991, p. 61). Our discussion 
will be limited to full text retrieval systems. For more 
information on database and hybrid document retrieval 
systems, see Fowler & Clipper, 1991, pp. 61-67. 

1. Index Formulation For Document Retrieval. 

Full text document retrieval systems save the entire 

text of the documents. The retrieval system creates an 
inverted index file of all the unigue words (minus 


“ The inverted index file is 


stopwords) in the database. 
several inter-related levels of indexes. The first level, 
is a dictionary of all unique words in th: document 
database. Each word in the dictionary ha. an associated 
occurrence list of all documents, as well as the location(s) 
within those documents, of every occurrence of that word. 
When a search for a word or phrase is conducted, the 
retrieval software performs a search of the index, in lieu 
of the entire database, providing faster retrieval times. 
(sand, 19387, pp. 84=85) 

The storage requirements associated with indexes of 
this nature range from 30 to 100 percent of the total size 


of the document database (Fand, 1987, p. 95). Therefore, an 


actual CD-ROM database is considerably smaller than the 540 


* Stopwords, are words that are not helpful in the 


retrieval of a document. Some examples are; an, by, can, 
do, for, have, if, & when. These words are not indexed, 
saving space and speeding up the retrieval process. 
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Megabyte storage space. For example, the size of the 
database could range from 415 to 270 Megabytes, which is 
equivalent to 207,500 - 135,000 pages of text, depending on 
the size of the index. 

2. Difficulties Associated with Document Retrieval 

When searching a document database the user is faced 
with the task of formulating a query that will satisfy 
prediction and futility point criteria (Blair, 19345 p- 
37 

The prediction criterion, requires the user 
accurately predict, how the information they are interested 
in, is represented, in order to formulate a query that will 
best retrieve that information. The searcher must formulate 
this query in a manner that will be exhaustive enough to 
retrieve a satisfactory number of relevant documents, yet 
specific enough to exclude retrieval of irrelevant 
documents. 

The futility point criterion, requires that in the 
process of document searching, the user will continue to 
refine his/her search until the set of documents retrieved, 
is small enough in number, that the searcher is willing to 
browse through them. The futility point for most searchers 
is from 20 to 50 documents (Blair, 1984, p. 372). Research 
conducted by McCarn and Lewis in 1990, states that the 


optimum value of information as related to retrieval size, 
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is 35 documents (McCarn and Lewis, 1990, p. 498). The user 
will therefore maximize the value of the information per 
retrieval set, if he/she limits the retrieved set to 
approximately 35 documents. 

Peucer Ss futility petme 1S not a function of 
database size. As the size of a database grows, the 
occurrence list of unigue terms in the database also 
increases. Therefore, in general, a search (using the same 
terms) of a large database will retrieve more documents than 
an equivalent search on a small database. Consequently, as 
the size of a document database Greases, so will the 
number of refinements (or search iterations) necessary to 
obtain a set of documents, less than or equal to the user's 
Poel ty) DOlNnt’. 

Search refinements or iterations usually take the 
form of search term modification, such as adding or 
subtracting terms in combination with boolean operators 
(AND, OR & NOT). This is generally very effective. One 
should use caution when refining a search query however, as 
the addition of more search terms will not always increase 
the probability of finding the information you desire. As 
the number of search terms increases, the number of 
different searchable combinations increases. Only if the 
correct combination(s) of these terms are chosen, will the 


searcher achieve better results. As an example of this 
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highly probabilistic process, assume that the information 
desired by the user is the shaded portion of Figure 5. The 
shaded portion of Figure 5 represents, an intersection of 
the documents that contain term A, with those that contain 


term C, or all the documents that contain terms A and C. 





Figure 5 Documents Desired 


A: Documents retrieved with term 
B: Documents retrieved with term 
C: Documents retrieved with term 
D: Documents retrieved with term 


UAW PY 


ASSUMPTIONS: 
A, «ByeC; oDeeeFrut lla oie 
All set intersections < Futility Point 


Ze 


Suppose the user commenced a search with term A, 
resulting in set A. Since the number of documents retrieved 
is greater than the users futility point, the query is 
modified to include terms A AND B. Results of the query are 
shown in Figure 6. 

Since the number of documents resulting from the 
query (terms A AND B) is below the users futility point, the 
search could stop here, and the desired documents would not 
be retrieved. Additional refinement of the query by adding 
term D (A AND B AND D), will result ina smaller number of 
retrieved documents, but will not retrieve the documents 
desired, as illustrated in Figure 7. 

Abandoning term B, and picking a new term C will 
result ina satisfactory query containing a portion of the 
relevant documents, as shown in Figure 8. Only by combining 
terms A AND C and abandoning terms B and D can the searcher 
retrieve all desired documents, as previously shown in 


Facgure 5. 





Figure 6 Inter- 


Section or A & B 





Figure 7 Terms A Figure 8 A AND C 
AND B AND D AND D 


22 


Understanding the capabilities as well as the 
potential pitfalls of query refinement through term 
enhancement, is increasingly important as the size of 


document databases increase. 


D. MEASURES OF DOCUMENT RETRIEVAL EFFECTIVENESS 

Document retrieval is indirect, probabilistic, time 
consuming, and has success measured by utility. These 
characteristics constrain the measurement of document 
retrieval effectiveness. Measures have not been found that 
diminish the effects of subjectivity on document retrieval, 
however accepted measures are used with these effects 
considered. 

The two most widely used measures of document retrieval 
effectiveness are, precision and recall (Blair and Maron 
1985, p. 290). Precision ratio, the fraction of retrieved 
documents which are relevant, characterizes the systems 
capability for retrieving only relevant documents. Recall 
ratio, the fraction of relevant documents retrieved, 
describes how well the system is at retrieving all of the 
documents that are relevant. The mathematical formulas for 
precision and recall are shown in Table 2. (McCarn and 


Lewis, 1990, p. 496) 
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TABLE 2. PRECISION & RECALL RATIOS 


Precision Ratio: PR L/S 


Recall Ratio: RR r/R 


total number of relevant documents in the database. 


ive 
S: number of documents retrieved by the search. 
r: number of documents relevant, of those retrieved. 





(Notation adopted from McCarn & Lewis, 1990, p. 496) 


Retrieval systems operate at various levels of recall 
and precision, by varying the specificity of the submitted 
query. The average performance for a query will move from 
low-precision/high-recall to high-precisi n/low-recall as 
the query is made more specific (Salton, 1986, p. 651). 

This relationship is shown in Figure 9. 

For example, consider a search for documents pertaining 
to "micro economics". A general search using the term 
"economics" will retrieve all documents containing 
"economics." Many of the documents that contain the term 
"economics" and not "micro economics", are probably 
irrelevant (low-precision), however those that are relevant 
would be found (high-recall). This would place the searcher 
at the low-precision/high-recall end in Figure 9. Making 
the search more specific by searching for "micro economics" 
will increase precision, as many of the irrelevant documents 
(those containing "economics" and not "micro economics") 


will not be retrieved. Those documents that were relevant 
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to "micro economics" and only contained Chevcerm cesnomics” 
will not be retrieved, which will lower recall. This moves 
the searcher in the direction of high-precision/low- 


recall.? 


Specific and Ideal 
Precision exhaustive indexing; 

narrow search iu 

formulation 


Average 


Broad indexing and 
genera) search 
Vv formulstion 





1.0 Recall 
Figure 9 Precision - Recall 
Relationship. (Salton, p. 336, 
1970) 


Traversing from low-precision/high-recall to high-pre- 
cision/low-recall is the process of satisfying the users 
futility point. Given that the user's initial query results 
in a retrieval set greater than his/her futility point, 
he/she will refine the query (1.e., make it more specific) 
to reduce the number of documents retrieved. Proper re- 
finement of a general guery will result in; a retrieval set 
of smaller size, with a higher percentage of relevant docu- 


ments, and non-retrieval of some relevant documents (Blair 


> Since 5S, <<"S, and ©. <~ © (aseeente GCCtnent> Ena 


contain only economics may be relevant to micro economics), 
PR, > PR,, and Since Ris Celotta jaan RR, - 
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and Maron, 1985, pp. 296-297). The result is moving from 


low-precision/high-recall to high-precision/low-recall. 


Precision and recall can not be used to compare 
databases of substantially different size and topical 
specialty. As the size of a database grows, recall will 
automatically decrease because of increases in the number of 
refinements necessary to achieve a retrieval set below the 
Meermsetucilityvepedne (blainm and Maron, 1985, pp. 296-297). 
Precision can not be used, as it is unclear if the changes 
in size and topical specialty will affect irrelevant 
retrievals more strongly than relevant retrievals (Doyle, 

Si Sawer, oO 7 ) 

A point of much controversy in using precision and 
recall ratios as measures of retrieval effectiveness is 
their basis on relevance. It is argued that relevance is 
not really a measure because it 1s not additive. For 
example, if documents were weighted on a scale from zero to 
ten based on their relevance, would two documents with 
weights of five and five, be equivalent to two documents 
with weights of eight and two? This controversy can be 
avoided if one strictly states that a document is either 
relevant or irrelevant, and dispenses with the determination 
of the degree of relevance. M. E. Leak and G. Salton define 


four variables that can affect a relevance judgement (Leak 


2/ 


and Salton, 1971, pp. 507-508). These four variables are 
listed below and are illustrated in Figure 10. 
Relevance is affected by: 


‘ "the type of document being judged - including 
its subject matter, level of difficulty, level of 
condensation, style." 


. "the conditions under which judgements must be 
rendered, that is the time available, the order 
of presentation and the size of the document set, 
the type of task specification." 


: "the statement specifying the information 
requirement which determines relevance." 


. "the type of judge used to render the judgements, 
that is his experience, background, attitude and 
so on." 


A comparison of the two databases developed for this 
thesis was made by controlling these variables. Each 
database contained the same documents, although 
representations were different, the difference did not have 
any bearing on their relevance. The conditions under which 
the judgements were made were controlled so as to maintain 
uniformity, the judge making the determination was the same, 
and the requirement of relevance was identical for each 


database. 
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Document Judgment Conditions 


(subject matter, content, (fime ovalloble, order of 
style, level of presentation, definition 
condensotion} of relevance) 


Relevonce 
Judgment 


Iinformotion Requirement Judge 


Statement 
SARNIA (knowledge, Intelligence, 


(content, specificity of experience, attitude, 
Information, textual use orientation, eérror 
attributes) preference) 





Figure 10 Variables of Relevance Judgements. 
(Lesk and Salton, p. 508, 1971) 


ie SEARCHING FEATURES OF FULL TEXT DOCUMENT RETRIEVAL 

There are a variety of different methods that may be 
used when performing a document search, which will provide 
different results. The searching methods that a software 
package provides, will significantly effect its usefulness, 
and should be carefully considered during its evaluation. 
The most notable searching methods are discussed below: 

1. Word Searching 

Searching for a word, is the most elementary form of 


text searching. When you perform a word search, the 


PA, 


retrieval system will locate all the occurrences of that 
word in its document database. Single word searching 
however, is quite general, and will typically result in very 
large sets of retrieved documents. For example, if you were 
interested in color scanners, and you performed a word 
search using "scanner(s)", chances are, you would locate all 
sorts of documents which discuss scanners, but only a few 
might discuss color scanners. To locate the documents you 
desire, would require browsing the retrieved document set, 
which could be very time consuming. 
2. Phrase Searching 

Phrase searching is the capability of searching for 
phrases consisting of consecutive words within a document. 
This capability can reduce the amount of extraneous 
documents retrieved. Searchable phrase lengths range from 
approximately two to four consecutive words, depending on 
the capability of the retrieval software package. 

3. Proximity Searching 

Proximity searching is a mixture of word and phrase 
searching with greater capabilities. Proximity searching 
allows the searcher to specify words or phrases that can be 
searched for within proximity of each other. The proximity 
can be words, lines, paragraphs, or documents, depending on 
the capabilities of the software retrieval package. For 


example, to find information on "how a color scanner works" 
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you might specify a proximity search of "color scanner(s)" 
in the same paragraph as "operation". As with all full text 
searching, it is important to think of how the document may 
have been written, and pick terms accordingly, instead of 
paraphrasing ideas or topics. 
4. Boolean Operators 

Boolean operators provide the searcher with a power- 
ful tool for formulating and refining queries. Most 
retrieval programs provide three boolean operators; AND, OR, 
and AND NOT. These operators can be used in combination 


¢ The boolezn operators AND, 


with words, phrases and sets. 
and AND NOT generally shrink the size of retrieval set, as 
the resulting documents must satisfy both of the search 
criteria. The boolean operator OR, tends to enlarge the 
retrieval set, as a document is included in the retrieval 
set if it satisfies either of the search criteria. 
5. Wildcard Operators 

Wildcard operators can be single or multiple 
character wildcards. Single character operators, take the 
place of one character; multiple character operators, take 
the place of many characters. Wildcard operators are 


particularly useful due to the varying sophistication of 


automatic indexing programs. The procedures used by 


° A set is a group of documents previously retrieved, 


meeting the criteria of a specified search. 
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automatic indexing programs for handling word suffixes 
varies. For example, some automatic indexing software index 
only the roots of words, where as others do not. This means 
that a word search with "farm", conducted on a database 
indexed with a package that indexes only the roots of words, 
will retrieve documents also containing; farms, farmer, 
farmed, and farming. For those packages that do not index 
only word roots, similar results could be obtained using a 
multiple wildcard operator. A search for "farm?" (where ? 
1s a multiple wildcard operator), would retrieve documents 
containing terms with "farm" as the first four letters, such 
as; farm, farms, farmer, farmed, farming, farmyard, 


farmhouse, farmstead, etc. 


F. BROWSING'S ROLE IN DOCUMENT RETRIEVAL 

Browsing is vitally important to a searcher, as it 
allows them to read through information as one would read a 
book. Since this 1S second nature, it provides a familiar 
way to find information or validate information found 
through searching. Browsing, can be thought of as a method 
of document retrieval. Browsing deals with moving from the 
"where" it is, to the "what" is there. A user may pick a 
location in the database (a document or point ina 
document), and then look (browse) to see what is there. 
Searching, deals with traversing from the "what" they want, 


to "where" it is, meaning that the system is told "what to 
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look for", and then provides the user with "documents where 
iets mm@caled. (Z0Cliick, 11987, p. 65) 

BrowSing 1S particularly useful with document databases 
located on CD-ROM. Prior to the availability of CD-ROM 
document databases, searching was conducted on-line. Slow 
data transfer rates and high communication costs, meant the 
searcher did not have the time, or money, to freely browse 
retrieval sets, or the database. A searcher may now connect 
to a CD-ROM document database via his/her personal computer. 
This provides the searcher with fast, easy access for 
unlimited browsing, as well as searching. Thus, vroviding 
the searcher the full utilization of the complimentary 
search, browse features of full text document retrieval. 


(Zoe) liek, 1987, p.. 65) 
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IV. DATABASE DEVELOPMENT 


A. INTRODUCTION 

Documents existing in paper form can be converted into 
digital form as bit mapped images, or a combination of ASCII 
text and bit mapped images. Each form provides advantages 
and disadvantages in the conversion process, as well as the 
electronic retrieval process. To compare the advantages and 
disadvantages for each of these processes, and to determine 
the best overall form, two databases were developed. One 
database contains primarily text, with few images, the other 
primarily images, with little text. 

The equipment used to develop these databases consisted 
of a Zenith 286 personal computer operating at 12 MHZ. The 
computer contains 640 Kilobytes of RAM, two 20 Megabyte 
Winchester Drives, and a 4 Megabyte Kofax 8202 image memory 
board, for image compression and decompression when using 
the attached scanner. Image scanning was performed on a 
Fujitsu model M3094E flatbed scanner with automatic sheet 
feeder. All documents were scanned at a 300 DPI resolution 
with supporting software for image processing and text 


i 


recognition. A Worm Drive by Laser Drive Limited, model 


‘ Calera TrueScan Version 1.071 from Calera 


Recognition Systems, and Irecognize Profession Plus - Page 
Image Processing Software for Image Capture and Text 
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810 with 800 Megabyte cartridges (400 Megabytes per side) 
provided the mass storage capability necessary for database 
development. Wordperfect 5.1, by WordPerfect Corporation 
was used for word processing requirements. The databases 
were developed using KAware Optical Disk Publishing 
software, Version 1.0, by KAWARE Corporation Inc.. Any 
comments made either favorably or unfavorably, about any of 
the above mentioned software products, are the opinions of 
this author and are not intended as a software evaluation. 
The following sections will discuss development of each 
of the databases, tradeoffs associated with that 
development, and advantages and disadvantages of each 


database. 


B. DEVELOPMENT OF TEXT AND IMAGE DATABASES 

Each database contains the same documents, however the 
representation of the documents is different. The 
"text/image" database, contains the full text of each 
document. All figures, graphs, and pages not reliably 
converted to text (1.e. Cover page, Report Documentation 
page, Signature page, etc.) are represented as bit mapped 
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images. The "image/text" database contains images of each 


Recognition, Version 1.07 from iBase Systems Corporation. 


6 For more information on the reliability and accuracy 
of optical character recognition, see Taylor pp. 44-51, 


198 9x 


Ss 


page of each document. The only text in the database, is 
that which is necessary to provide a description of each 
image for indexing and retrieval purposes. Since each 
database contained the same documents, their development 
required many of the same steps. These steps are discussed 
below, with their associated time requirements listed in 
Table 3. 

1. Document Preparation and Scanning 

This process was identical for each database. 
Documents were prepared by removing staples, smoothing bent 
pages, and aligning pages properly in the automatic document 
feeder. During scanning, the iRecognize software creates 
two files; one containing the image of each page in an image 
file format, the other a pointer file. The image file size 
is dependent on the number of pages scanned. Because the 
automatic sheet feeder was used, storage requirements were 
an important consideration, as a thesis containing 100 pages 
would generate an image file of approximately 5.0 Megabytes. 

2. Creation of an Image for each Page. 

This process was required for each database, however 
only pages not reliably converted to text, and portions of 
those pages containing figures and graphs were required for 
the text/image database. The scanned image of each page is 
in the iRecognize image file format, making it necessary to 


produce an image of each page in another format. Images of 
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each page were produced using the .PCX format, although the 
preferred .TIF format was available. The .TIF format was 
preferred as the disk publishing software will only accept 
.TIF format. The .TIF format accepted by the disk 
publishing software however, is not the version exported by 
most image scanners, making it incompatible. It was 
therefore necessary to export the images in a .PCX format 
and then convert them to a useable .TIF format uSing a 
commercially available software package called HIJAAK, 
version 2.01, by Inset Systems Inc.. Once the images were 
in a useable .TIF format, they were conve~ted to the .CPR 
disk publisher format (using the disk puk isher software) 
for later retrieval and display. The times required for 
each of these processes are shown in Table 3. Instances of 
two different values under the text/image header, are the 
times required for a figure, (portion of a page) anda full 
page, respectively. 

The storage requirements to accomplish this step 
were enormous. On average, each page image required 167 
Kilobytes in the .PCX format, 65.82 Kilobytes in the .TIF 
format, and 75.65 Kilobytes in the .CPR (Disk Publisher) 
format. An average size thesis document of 108 pages, would 
require; 18 Megabytes of storage space for the .PCX images, 
7.1 Megabytes for the .TIF images, and 8.2 Megabytes for the 


.CPR images, for a total of 33.3 Megabytes. 
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3. Optical Character Recognition 

OCR was only necessary for the text/image database. 
The process of OCR is dependant on such things as, original 
document quality, font of text used, and proper page 
alignment. All the documents used for this research, were 
of good quality, but contained a variety of fonts. On 
average, a 99 percent or better accuracy rate was obtained, 
which is about 25 character corrections per page. The times 
required for OCR and error correction are shown in Table 3. 

4. Text Preparation and Markup 

At this point, the full text of each document, 
images of all pages of each document, and images of figures, 
graphs, and charts had been generated. It was then 
necessary to link the full text of each document with 
associated images of figures, graphs, and charts to develop 
the text/image database. To complete development of the 
image/text database, terms describing each page image (or 
section of page images) were determined to provide search 
and retrieval capabilities. This process is discussed 
below, with applicable times shown in Table 3. 

a. Text/Image Database 

Full text of each document was brought together 

into one file, where erroneous markings such as file 
headers, extra spaces, and lines were removed. Markers were 


placed in the text to divide each original document 
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(document subdivision will be discussed more in section C, 
Performance and Development Tradeoffs) into smaller 
sections.’ Each section is an individual document from the 
perspective of the retrieval system. Tags and image calls 
were placed in the text to provide access to associated 
images. A tag, is a predetermined unique set of characters 
that differentiate for the indexing system, text to be 
indexed, text used for display purposes only, and markings 
used for image calls. An Image call, is a reference to a 
file name containing the desired image. Image calls were 
placed in the text for each of the figures, graphs, 
diagrams, and pages not converted to text. (KAware Disk 
Publisher User's Manual, 1990, pp. 6(1)-6(19) ) 
b. Image/Text Database 

Text used to describe images of each page, was 
initiated using the text from the abstract and table of 
contents. Key words, phrases, and synonyms for a particular 
page or section were then added to the table of contents of 
each document to provide better retrieval. The 
determination of these key words or phrases, was an 
extremely difficult, time consuming process. It required 
reading each page of the document, and then determining 


words that would best represent the contents. Tags and 


? A marker is a unique character or string of 


characters used by the retrieval system, to divide a 
document into smaller logical documents. 
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markers were then added for indexing purposes, with image 
calls inserted for each page of the document. The markers 
added to this database did not break the original document 
down as far as the text/image database, due to the limited 
text available to index. 
5. Automatic Index Generation 

The time required to automatically generate the 
index, is directly related to the amount of text being 
indexed. Index generation time varied, due to the 
differences in the amount of text being indexed. Index 


generation time for each database is shown in Table 3. 


TABLE 3. TIME REQUIREMENTS FOR DATABASE DEVELOPMENT 


Text/Image Image/Text 


Document Preparation 5.0 min 5 0mman 
Scanning Time/Page 13 0S ec 135.0. See 
-PCX Image Generation 2050/6000 see 60.0 sec 
~PCX to .TIF Conversion 45.0/90.0 sec 90.0 sec 
-eTIF to -CPR Conversion 35.0 sec 35.0 sec 


OCR/ Page 65.0 sec N/A 


OCR Error) Gorrect ion, Page 4. 08min N/A 


Text Correction, Index, 
Tagging and Image Calls 90.0 min 3.0 min/page 


Automatic Index 
Generation 5.0 min 45.0 sec 
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6. Thesis Conversion Time 
Using the assumption that an average thesis is 108.3 
pages long, containing 25 percent figures and graphs, the 
time to process a hard copy thesis document into an 
electronically retrievable format can be calculated (Taylor, 
Eeeeeeoeo)= Tables 4 and’ 5 illustrate these 


calculations for a image/text and text/image database. 
TABLE 4. IMAGE/TEXT DATABASE CALCULATIONS 


Document Preparation: 


Scanning Time: 
(108.3 pages x 13.0 sec/page) + 60 sec/min = 


Image Generation: 
108.3 pages * (60 + 90 + 35) sec) + 
60 sec/min = 


Text Preparation & Markup: 


3.0 min/page * 108.3 pages = 


Automatic Index Generation: 30.0 


6G8/.7 
Or yo 11% 5 


Note: If the process of .PCX to .TIF image conversion 


is avoided, 2.7 hrs are saved, yielding 
a time of 8.8 hrs. 





If the conversion of images from the .PCX to .TIF 


format was not necessary, the total time required to convert 


ve "Selecting 20 thesis documents at random, the aver- 


age size of a thesis was 108.3 pages in length, of which 
27.9 pages were graphs or charts and 7.4 pages were 
Puleceuines. (Taylor, p. 52, 1989) 
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a thesis (in paper form) into a retrievable digital form, is 
approximately the same for either text/image or image/text 
format. The time gained by not performing OCR and 
correction for the image/text database, was lost determining 
proper terms to index the page images on. 

This reveals a significant point. When discussing 
the time requirements for conversion of paper documents to 
digital form, it is necessary to consider the combined 
process of, both converting that document to digital forn, 
and proper markup and indexing of that document for later 
retrieval. As evidenced here, the gains made in eleseaentic 
conversion, using an imaged based database, were off set by 
difficulties encountered during markup and indexing, making 
the overall time for each database method approximately 


equal. 
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TABLE 5. TEXT/IMAGE DATABASE CALCULATIONS 


| Document Preparation: 


Scanning Time: 
(108.3 pages x 13.0 sec/page) + 60 sec/min = 


| Image Generation: 
\ (279 | tp eveeeeepages * (30 + 45 + 35) sec) + 

| 60 sec/min = 64-7. man 
(a> pages * (60 + 90 + 35) sec) + 60 sec/min = 95.2 mn 


OCR: 
70# pages * (65 sec + 4.0 min) = 255), Oven 


| Text Preparation & Markup: 90.0 min 


| Automatic Index Generation: 1.5 min 


549.7 min 
Or 9.2 hrs 


Note: If the process of .PCX to .TIF image conversion 
is avoided, 0.5 hrs are saved, yielding a 


time of 8.7 hrs. 


Pages containing Charts or Graphs 

Pages containing Pictures 

Full Pages: Cover, Document, and Signature Pages 
108.3 - (27.9 + 7.4 + 3) = 70 
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cr PERFORMANCE AND DEVELOPMENT TRADEOFFS 

Many of the decisions made during database development 
affected performance. The following sections will discuss 
the development decisions made, the impact these decisions 
will have on performance, and the actual performance of the 
two databases. 

1. Development Decisions and Tradeoffs. 

One of the biggest decisions during database 
development was whether or not to divide documents into 
smaller documents for retrieval, and if so, how small. 
Subdivision increases search precision, but requires more 
search time. To ensure that a searcher can always determine 
the author of a document, an image call for the cover page 
(or applicable title page) must be placed in each logical 
document. This means that each level of document 
subdivision (chapters, sections, subsections etc.) requires 
additional time to allow for insertion of the proper markers 
and image calls. When doing a search, the system locates 
the logical document that contains the word(s) you are 
searching for. A document 100 pages long not subdivided, is 
considered one logical document for retrieval. A term or 
phrase located somewhere in 100 pages, is not as precise as 
locating that same term(s) in a smaller section of possibly 
a few pages or less. The effects of subdivision, are most 


Significant when using boolean operators. Without 
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subdivision, the entire document is considered the boolean 
domain for the search. Proximity searching however, is 
unaffected by logical document size, as the search is 
conducted by finding words within a specified range of each 
other. To determine the proper balance between increased 
precision, and time required for text markup, queries were 
conducted on each database under various subdivision 
conditions. These will be discussed more in "Database 
Performance". 

An additional problem, only affecting the image/text 
Gata base was, the association of words, phrases and or 
synonyms with the page images for proper indexing and 
retrieval. The under-lying problem with document retrieval, 
continues to be, that the searcher does not know the "actual 
words" the author(s) uses when writing about a particular 
topic, and a particular document may pertain to a number of 
different topics. It is important to determine the 
appropriate words to associate with a page image (or group 
of page images) to best aid the "typical" searcher in 
finding that information. This author concluded, "there is 
no magical solution." Different people use different words 
to describe (index) the images, just as different people use 
different search terms when looking for information. A 
related problem, is the best number of words to associate 


with the page image. As the number of words associated with 
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each image increases; the time required to determine these 
terms increases, but the likelihood of retrieving that page 
also increases (recall will increase). Precision, however, 
will decrease, as more of the image pages will have 
identical terms associated with them. This is illustrated 
by Figure 11, which shows how recall and precision ratios 
change as a function of the number of index terms per 
document (Doyle, 1975, p. 338). This author compromised 
between the time requirements necessary to determine the 
index terms, and the associated recall/precision, by 


associating approximately five terms with each image page. 


x 30 terms 


x 20 terms 
Sy 10 terms 


Recall ratio, % 





0 50 100 
Precision ratio, % 


Figure 11 Effects of Index 
Term number on Precision and 
Recall. (Doyle, ». 338, 1975) 
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2. Database Performance 
To determine the advantages and disadvantages of 
each database, and properly ascertain the best level of 
document subdivision, a set of queries was performed on each 
database. This set of queries was performed on the 
text/image database, under four different markup conditions 
(1 - 4), and on the image/text database under three 
different conditions (2 - 4).'' These different markup 
conditions are as follows: 
¢ Condition One: Each document was subdivided into the 
smallest logical section. This ranged from section to 
subsubsection depending on the document. 


* Condition Two: Each document was subdivided by sections 
of a chapter. 


* Condition Three: Each document was subdivided into 
chapters. 


¢ Condition Four: Documents were not subdivided. 


To provide a better understanding of the 
conclusions, it is important to have a general understanding 
of the features of the KAware retrieval system. The system 
provides two modes of retrieval display; "display retrieved 
documents" or "display all documents." If the system is set 
to "display all documents", the whole file is available for 


browsing after a search has located matching documents. 


"Pocument subdivision below the section level, was 


impractical due to the limited amount of text on which to 
index. 
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When browsing, the user can jump from term to term (in or 
among documents), or from document to document throughout 
the entire file. If the system is set to "display retrieved 
documents", then only those logical documents, located by 
the search query, are available for browsing using the same 
capabilities listed above. All search capabilities 
discussed in Chapter III are available. 
a. Level of Document Subdivision 

Division of documents by section provided the 
best overall performance considering: ease of browsing; 
precision; recall; and the time required to insert image 
calls, and perform text markup. This conclusion is based on 
the following: 

+ Document breakdown below the section level did not 
Significantly increase precision, and had no effect on 
recall. 

¢ Document breakdown below the section level increased 
Significantly the time required to insert image calls, 
and perform text markup. 

¢ BrowSing documents broken down below the section level, 
was not appreciably easier or less time consuming, then 
browsing documents broken down at the section level. 

¢ Document breakdown at the chapter level was not 
conducive to easy browsing of the document. Viewing 
different sections required scrolling through pages of 
material, in lieu of simply jumping from section to 
section, as in section breakdown. 

¢ Boolean searches conducted with documents broken down by 
chapters, yielded a larger number of irrelevant 


documents (lower precision) then documents broken down 
by sections, aS associated terms were found pages apart. 
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- Retaining each document without breakdown, made browsing 
more time consuming then chapter breakdown, due to the 
size of each logical document. 

¢ Boolean searches conducted with documents not broken 
down, resulted in many irrelevant documents, as associ- 
ated terms were found pages or even chapters apart. 

As was expected document breakdown did not affect 
proximity searches, as retrieval performance for the two 
databases was consistent over all markup conditions using 
proximity searching. 

b. Retrieval Advantages and Disadvantages 

From the user's view point, it was much easier to 
obtain data from the text/image database, then the 
image/text database. This was as a result of, being able to 
read what was located "directly", and an increased 
capability for finding desired information, by searching the 
full text of each document. Terms located in the text/image 
database, are in the actual full text of the document. When 
browsing for relevance, the user is reviewing the actual 
text of the document. Terms located in the image/text 
database are found among terms used to index the images. To 
determine relevance, the user must scroll through a series 


of images. Each database had its own advantages and 


disadvantages, which are listed below. 
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(1) Text/Image Advantages and Disadvantages 


Advantages: 


Having the full text of each document to index and 
search on made browsing the document much easier. 


Provided higher precision and recall for the same level 
of document breakdown than the image/text database. 


Provided direct access to information as text is 
searched and viewed directly, in lieu of searching an 
index, and then viewing page image(s) associated with 
that index, to determine relevance. For example, when a 
word or phrase is located in a full text document 
database, the system displays the appropriate section(s) 
of the document, with the word or phrase highlighted. 
This immediately draws the user's attention to that 
section of the document. That same word or phrase would 
be located in the index of a image based document 
database. The user would then have to view the page 
image(s) associated with that index word or phrase to 
determine relevance. 


Images of figures, charts, or graphs could be printed if 
desired without added printer memory. 


Disadvantage: 


Full page images associated with the database, such as 
the cover, document, and Signature pages, could not be 
printed without increasing standard printer memory, due 
to their large size (75.67 Kilobytes average). 


(2) Image/Text Advantages and Disadvantages 


Advantages: 


Images of each page provided actual layouts of each 
page, preserving all bolding, underlines, italics, and 
line spacing, making the text structurally easier to 
read. 
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* The majority of information on the page images, was 
located using only the terms found in the Table of 
Contents, implying that much of the time spent 
developing additional terms on which to index could have 
been saved. 


Disadvantages: 


- Access to information was indirect, as images were 
viewed to determine relevance, in lieu of reading the 
Heoxe directly? 


¢ Enormous storage requirements. Image/text database 
required 6.7 times more storage space than the 


text/image database. 

- Images of full pages could not be printed with normal 
printer memory due to their large size (75.67 Kilobytes 
average). 

A problem common to both databases, resulting for 
the KAware Disk Publisher, is the display of images. Images 
(pages, figures, charts, or graphs) can be displayed at four 
different zoom levels, however, only the two greatest zoom 
levels were effective for reading text from a full page. At 
that level of zoom, a full line of text could not be read 


without scrolling, which made reading a page very tedious. 


D. SUMMARY 

The paper to digital conversion of a document required 
approximately equal times for both the full text 
(text/image) and the image based (image/text) forms. 
Analysis showed that time during the text preparation and 


markup may be saved, due to the fact that most of the 
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information was located using the terms from the Table of 
Contents (image/text database). This would result in a 
shorter conversion time for the image/text database, then 
the text/image database. This time savings however, is out 
weighed by, ease of browsing, direct access to information, 
and increased precision and recall of the text/image 
database. Any conversion of paper documents should use the 


full text approach. 
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V. SUBMISSION OF THESIS DOCUMENTS IN ELECTRONIC FORM 


A. INTRODUCTION 

Proliferation of the personal computer and word 
processing software implies that documents exist in 
electronic form at their completion. Documents in 
electronic form can be stored on digital media, such as 
optical disks, with less effort than converting paper 
documents to electronic form for storage and retrieval. The 
Naval Postgraduate School receives approximately 230 thesis 
documents per quarter. On average, approximately 12 copies 
of each of these documents are reproduced for distribution. 
Reproduction and mailing costs alone, are estimated to be 
over $26,000.'* Since these documents exist in electronic 
form at their completion, storage and reproduction of an 
electronic version would eliminate the need for converting 
future paper thesis documents to digital form. 

Turning in completed thesis documents in electronic 
form, does require some special requirements and decisions. 
For instance, in what electronic form (WordPerfect 5.1, 


ASCII text, Microsoft word, etc.) should they be turned in? 


'@ Estimate is based on an average size thesis (108 


pages) costing approximately $2.90/thesis to mail. Mailing 
costs are paid from a centralized BUPERS fund, reproductions 
costs are paid by the NPS. 
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What special requirements exist for graphics, and pages 
(Cover, Document and Signature pages) generated on Xerox 
work stations? What word processing software is currently 
available on campus for students to produce their thesis? 
The following sections will discuss the current thesis 
preparation tools available on campus, special requirements 
for turning in thesis documents in electronic form, as well 
as advantages and disadvantages of documents in WordPerfect 


5.1 and ASCII text formats. 


B. CURRENT THESIS PREPARATION TOOLS 
Students can produce their thesis one of five ways: 
* Use their own personal computer and word processing 
software. 
* Hire a thesis typist to prepare their thesis. 


* Use a software package called G-Thesis available on the 
Naval Postgraduate School mainframe. 


¢ Computer Science department students can use a software 
package called Framemaker, available on Sun work 
stations in their laboratories. 

* Use a software package called WordPerfect (current 
version 5.1), which is available in all micro labs with 
word processing capability, located on the Naval 
Postgraduate School campus. 

A number of different word processing software packages 
are available on the market, although the most widely used 


is, WordPerfect, by WordPerfect Corporation. It is the 


administrative computing standard on the Naval Postgraduate 
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School campus, and is used by most thesis typists and 
students. For this reason, the discussion of thesis 
preparation tools will be limited to; G-thesis, Framemaker 
and WordPerfect. 
1. Thesis Preparation using G-Thesis 

G-Thesis 1s a word processing software package, 
available on the Naval Postgraduate School mainframe. Its 
capabilities are cumbersome, when compared with software 
packages available for a personal computer. A formal survey 
was not done, however, Naval Postgraduate School Computer 
Center personnel estimate that, G-Thesis is used by few 
students each quarter. Script, the prog: .m used by G- 
thesis, continues to be offered as it uses little disk 
storage space and processing time on the NPS mainframe. 
Script costs very little to buy and maintain as it is one of 
software packages obtained by the Naval Postgraduate School 
under the Higher Education Software Consortium.'? G-thesis 
can export documents in an ASCII text format, however, there 
1S no way to preserve the graphics, footnotes, endnotes, or 


page numbers created by the software. 


'5 The Higher Education Software Consortium is a 


program from IBM, where qualifying institutions can obtain 
as many software packages as they desire (froma 
predetermined list), for one price. 
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2. Thesis Preparation using Framemaker 

Framemaker, 1S a capable word processing software 
package available on the Sun Work stations in the Computer 
Science Department. Documents created with Framemaker are 
in a Post Script format, but can also be exported in an 
ASCII text format. Graphics, footnotes, endnotes, and page 
numbers created by the software are not preserved when the 
document is exported in ASCII text format. Graphics 
contained in the document can be preserved by saving them 
separately, as images, before they are imported into the 
Framemaker document. 

Enrollment in the Computer Science curriculum is 
approximately 20 to 25 students per class (every other 
quarter), which is a small number of students, when compared 
to an entire graduating class, numbering 230. 

3. Thesis Preparation using WordPerfect 5.1 

WordPerfect, by WordPerfect Corporation, is the 
administrative computing standard for word processing 
software on the Naval Postgraduate School campus. Every 
micro computer laboratory at Naval Postgraduate School, that 
has word processing software installed, uses WordPerfect 
5.1.'* In addition to being widely available, all students 


in the Administrative Science Department take a course (IS- 


4 WordPerfect 5.0 and Microsoft Word are also 


available, but only in laboratories where WordPerfect 5.1 is 
also available. 
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0123) in WordPerfect 5.1. Thesis styles for WordPerfect 5.1 
are available from the Naval Postgraduate School Computer 
Center. These styles will correctly format a document to 
meet the requirements for thesis completion. Thesis styles 
are easy to use, and require little experience with 
WordPerfect. Instruction is given throughout each quarter 
to students wishing to use these styles, to provide basic 
instruction, and answer guestions. Beta testing is current- 
ly being conducted by Naval Postgraduate School Computer 
Center personnel, on software by WordPerfect Corporation, 
that would allow the mainframe's high speed laser printer, 
to print WordPerfect documents created on micro computers. 
WordPerfect 5.1 is a capable word processing 
software package. Like other word processing software 
packages, graphics contained within a WordPerfect document 
are not exportable in an ASCII text format; however, 
footnotes, endnotes, and page numbers created within the 
document, can be retained, using WordPerfect's Standard 
Printer driver, by sending the output to a file, instead of 


the printer. 


C. FORMAT REQUIREMENTS FOR ELECTRONIC DOCUMENTS 

A document submitted in electronic form must contain all 
aspects of the same document in paper form. This means that 
all pages of a thesis (from Cover page to Distribution 


list), must be included in an electronic version. A thesis 


ony 


is composed of: Cover, Document (DD-Form 1473), and 
Signature pages; List of Tables/Figures, and Table of 
Contents; Document body containing text and graphics; 
Appendixes, List of References, Bibliography, and Initial 
Distribution List. The Cover, Document and Signature pages 
of each thesis can be produced on Xerox work stations. These 
pages would be submitted as images to provide author 
accreditation (Cover Page), preserve Document structure 
(Document Page), and provide thesis authenticity (signed 
Signature page). The remaining parts of a thesis are 
composed of text (and graphics) that can be created with a 
word processing software package. Graphics contained within 
a document created by a word processing software package 
must be saved as separate images, because once imported, the 
graphics become part of the document, and can not be 
exported as ASCII text. Regardless of the word processing 
software package, the final format of an electronic document 
must be compatible with the Optical Disk Publishing Software 
being used. The Optical Disk Publishing Software used for 
this thesis; KAware Optical Disk Publishing Software, 
Knowledge Access International, was chosen because it 
provided many of the capabilities available in similar 


software packages, at a fraction of the cost. 
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KAware's optical disk publishing software requires that: 


- Text of documents being indexed must be in ASCII text 
format and oy contain ASCII characters between 32 and 
Pomucecimal) . 

- Images of graphics must be in .TIF format with one image 
per file. 

Word processing software packages will export documents 
in ASCII text format; however, many will not preserve items 
produced using hidden codes, such as page numbers, 
footnotes, endnotes, and graphics. Since these items are 
part of the document, they must be preserved. Graphics can 
be preserved as separate images using the graphics program 
which created them, or by using an optical scanner. 

The requirement of having images of graphics, in 
addition to, the text of the document, raises a potential 
problem for many students at Naval Postgraduate School. 

Most graphics contained in thesis documents, are a result of 
cutting, pasting, and photocopying of pages for inclusion in 
the thesis. For most students, the production of images of 
graphics, will require them to optically scan these 
graphics, to produce bit mapped images. Most students have 
never used an optical scanner, thus requiring additional 


time when producing their thesis. Two optical scanners are 


' Documents may be in any word processor format while 


performing markup and inserting image calls. Prior to 
indexing with the disk publisher software however, the text 
must be converted to ASCII text. 
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currently available in the micro computer lab, in Ingersoll 
I-151. The "Discover" scanner, will produce a .TIF image 
that is compatible with the KAware disk publishing software. 
The Xerox scanner, will produce a .TIF image, however it is 
not directly compatible with the disk publisher software. 
The Computer Center has purchased, and is currently evalu- 
ating (for placement in micro labs), a software package 
Called HIJAAK, that would allow .TIF images produced by the 
Xerox scanner, to be converted to a .TIF format, acceptable 


‘6 KAware Corporation 


by the disk publishing software. 
Inc., is currently working to improve the image 


compatibility of their software. 


D. CONVERSION OF A DOCUMENT IN WORDPERFECT FORMAT 

To ascertain the effort required to transform an 
electronic document in WordPerfect format, to a retrievable, 
searchable document, an experiment was conducted, ona 
document in WordPerfect format. This document was created 
with WordPerfect 5.1, using NPS thesis styles. This thesis 
is 245 pages and contains 31 figures or graphs. Each figure 
or graph in the document, was in a separate image file in 
-.WPG (WordPerfect Graphics) format. A separate analysis of 


a WordPerfect document, created without thesis styles was 


‘6 HIJAAK, version 2.02, by Inset Systems Inc., is a 


commercially available software package that will convert 
images from one format to another. 
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not conducted, however, the author does not believe it would 
take substantially longer to markup such a document due to 
the search and replace features of WordPerfect 5.1. 

Styles used to create a document can be easily modified. 
Modifications made, automatically reformat affected text. A 
set of styles can be modified, saved, and then retrieved 
into a document, replacing the same styles used to create 
the document. This would preclude making the same changes 
to styles of each document. Inserting markings for document 
subdivision, and image calls, for proper author 
accreditation at section headings, was easily accomplished 
by modifying the document styles. Style .odifications were 
also used to change document margins. When a document is 
converted to ASCII text, left margins are filled in as 
spaces. Top and bottom margins produced significant blank 
areas in the text. Margin modification alleviated these 
problems. Search features of WordPerfect, reduced the 
effort required to insert image calls, for graphics 
contained in the document. Graphics in the document, were 
located by searching for hidden codes. Proper markings and 
image calls were then inserted, by retrieving an existing 
file, containing this information. Graphics were then 
deleted from the WordPerfect document, to prevent large 
blank areas among the text when the document was converted 


to ASCII text. Upon completion of the markup process, the 
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document was converted to ASCII text, using WordPerfect's 
standard printer driver. Time required to perform this 

markup process, convert images to the .CPR disk publisher 
format, and index the document for subsequent search and 


retrieval, is shown in Table 6. 


TABLE 6. CONVERSION OF A DOCUMENT IN WORDPERFECT FORMAT 


Image Generation: 
((31 graphic images) * (45 + 35) sec) = 
60 sec/min = 40 30 


(3* pages * (90 + 35) sec) + 60 sec/min = 6.25 min 


Text Preparation & Markup: 35 200" ma n 


| Automatic Index Generation: 1.50 main 


84.05 min 
or 1. 4a S 
Note: If the student performs image conversion to 
the proper .TIF format, 0.46 hours are saved, 
yielding a time of 1.04 hrs. 





Full Pages: Cover, Document, and Signature Pages 


It should be noted that, the thesis used for this 
experiment, has over twice as many pages aS an average 
thesis document. The number of graphics however, is 


approximately equal. 
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E. CONVERSION OF A DOCUMENT IN ASCII TEXT FORMAT 

To perform this portion of the experiment, the 
WordPerfect document used above, was converted to an ASCII 
text document, using the standard printer driver of 
WordPerfect 5.1. 

The process of inserting markings for document 
subdivision, and image calls for graphic image display, was 
a tedious process. Without the ability to perform search 
and replace functions, on hidden codes found at section 
headings and graphic locations, the entire thesis had to be 
reviewed page by page. When image calls and markings were 
inserted in the text, reformatting was required, adding 
additional time to the process. Time required to perform 
this markup process, convert images to the .CPR disk 
publisher format, and index the document for subsequent 
search and retrieval, is shown in Table 7. Conversion of 
the ASCII text document, took almost twice as long as the 


same WordPerfect document. 
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TABLE 7. CONVERSION OF A DOCUMENT IN ASCII TEXT FORMAT 


| ((31 graphic images) * (45 + 35) sec) + 
60 sec/min = 41.30 min 


(3* pages * (90 + 35) sec) + 60 sec/min = 6.25 min 


Text Preparation & Markup: 90.00 min 


Automatic Index Generation: 1.50 min 


139. 05) man 
or 2.3 hrs 
Note: If the student performs image conversion to 
the proper .TIF format, 0.46 hours are saved, 
yielding a time of 1.84 hrs. 


Full Pages: Cover, Document, and Signature Pages 





F. SUMMARY 

WordPerfect 5.1 is a widely accepted word processing 
software package, available to students in virtually every 
micro computer laboratory on the Naval Postgraduate School 
campus. Its ease of use, and many capabilities, save 
substantial time during the transformation of electronic 
documents for storage and retrieval on optical disks. These 
factors make WordPerfect the logical choice for the 
electronic document standard. 

The number of graphics contained in an electronic 
document, significantly affects the length of time it takes 
to convert that document to a disc publisher format. Each 


graphic must be converted to disk publisher image format, 
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and must have an image call placed in the document text to 
display it; these are both labor intensive processes, as 
shown in Tables 6 and 7. 

Graphics, contained in electronic documents, must 
accompany the text, as images. This will require additional 
effort by the students, and support by the school, to 
provide training in the use of optical scanners and in the 


graphics capabilities of WordPerfect. 
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VI. CONCLUSIONS, RECOMMENDATIONS 


A. CONCLUSIONS 

Conversion of paper documents, for storage and retrieval 
on optical disks, is an expensive, time consuming process. 
Conversion and storage of electronic documents is more 
efficient, requiring less than one-eight of the time. 
Conversion and storage of a document, as a collection of 
images, is not more efficient than full text conversion and 
storage. This is due to the large time requirement 
necessary, to produce an index, which provides search and 
retrieval capability, for the collection of images. 

Full text document databases, provide better search 
Capabilities, than image based document databases. The 
ability to search on the text contained in a document, 
provides superior flexibility, to accommodate the wide range 
of searcher experience levels, and thought processes. 

Submission of thesis documents in electronic form, is 
the best solution for long term future storage and 
distribution of these documents. The electronic form used 
for document submission should be WordPerfect (5.1), since 
it is readily available, and yields the greatest benefits 
during the conversion process. New versions of WordPerfect 


(1f produced by WordPerfect Corp., and adopted by NPS) would 
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not cause Significant problems. A transition period could 
be established where either version 5.1 or the later version 
would be accepted. WordPerfect documents created using 
version 5.1 would be imported and converted to the new 
version as WordPerfect's software has always been backwards 
compatible. Submission of thesis documents in electronic 
form, will require additional work by the student, but will 
provide efficient use of technology. Electronic documents 
are easily converted for storage and retrieval on optical 
disks. Optical disks are inexpensive to mail, and require 
little storage space. Conversion and stc~age of electronic 
thesis documents directly, will reduce re ,uirements for 
storage, and digital conversion of paper thesis documents in 
the future. 

Current optical scanning capabilities, will not support 
the requirement, to provide images of all graphics contained 
in the document, as well as images of cover, document, and 
Signature pages of a thesis document. Students will require 
access to software that is capable of converting images to a 


format compatible with the optical disk publishing software. 


B. RECOMMENDATIONS 

Due to the extreme differences between converting paper 
and digital documents, this researcher does not recommend 
conversion of paper documents. Instead, a program should be 


instituted, where students submit one paper, and one 
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electronic copy of their thesis. The electronic copy, would 
be converted and stored on a CD-ROM optical disk with other 
theses, for storage and distribution. The paper copy would 
be mailed to the Defense Technical Information Center for 
processing and storage. The Defense Technical Information 
Center would prefer to receive documents electronically; 
however, no program iS currently underway to accomplish this 
until the Department of Defense establishes a standard for 
Electronic Data Interchange (EDI), under the Corporate 
Information Management (CIM) initiative (Viel, 1992). 

This researcher recommends, that the HIJAAK software 
purchased by the NPS computer center, be made available in 
the micro computer labs on campus. This will provide 
students the ability to submit images ina .TIF format that 
is compatible with the disk publishing software. 

The WordPerfect course taught to students in the 
Administrative Science Department should be broadened, to 
include training on the graphics capabilities of WordPerfect 
5.1. A similiar course should be established in other 
departments to provide instruction to their students. This 
will decrease the number of students required to learn 
WordPerfect on their own. 

To provide the optical scanning capabilities necessary 
for electronic document submission, additional scanners 


should be purchased for use in the Learning Resource Centers 
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(LRCs) and micro computer laboratories on campus. Clear, 
concise instructions should be furnished to students, 
describing basic operation. Instruction given on 
WordPerfect thesis styles, should be broadened to include 
instruction on the optical scanners located in Ingersoll I- 
151, as well as the use of the HIJAAK software mentioned 
earlier. 

A project for a follow on thesis, would be to establish 
a system, for mastering CD-ROM optical disks, containing 


completed thesis documents. 
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